Credit: National Cancer Institute on Unsplash
In this document you will find indications on how to use various
R libraries and functions for estimating true disease
prevalence and accuracy of diagnostic tests in a Bayesian framework,
along with various exercises. We suppose that you do have some basic
R skills. If you have not worked with R before
or feel a bit rusty, here are some resources to help you to prepare:
Chapters 1, 2, and 4 of the CodeCademy “Learn R” course will provide a good overview of the basic concepts required for this workshop.
If you are familiar with R and want to do some
further reading, Hadley Wickham’s “R
for Data Science” is a great resource.
Remember, there are often many different ways to conduct a given
piece of work in R. Throughout this document, we tried to
stick with the simpler approaches (e.g., the shortest code, the minimal
number of R libraries).
Throughout the document, you will find examples of R
code along with comments. The R code used always
appear in grey boxes (see the following example). This is the
code that you will be able to use for your own analyses. Lines
that are preceded by a # sign are comments, they are skipped
when R reads them. Following each grey box with
R code, a white box with results from the analysis
is presented.
For instance, this is a R code where I am simply asking
to show main descriptive statistics for the speed variable of
the cars dataset (note that this dataset is already part of
R).
#This is a comment. R will ignore this line
#The summary() function can be use to ask for a summary of various R objects. For a variable (here the variable speed), it shows descriptive statistics.
summary(cars$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
Throughout the document, in the text I will use:
- Italics for datasets or variables. For instance, the speed
variable of the dataset cars.
- Shaded boxes for R libraries (e.g.,
episensr) and functions (e.g., summary()).
In R we can first call a given library and then use the
functions related to each library or we can name the library followed by
:: and then the function. For instance the two following
pieces of code are equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
ggplot2::ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
The latter may improve reproducibility, but to the expense of longer codes. Throughout the document, we will always first call the library and then run the functions to keep codes shorter.
One last thing, when using a given function, it is not mandatory to
name all the arguments, as long as they are presented in the sequence
expected by this function. For instance, the ggplot()
function that we used in the previous chunk of code is expecting to see
first a dataset (data=) and then a mapping attribute
(mapping=) and, within that mapping attribute a x variable
(x=). We could shorten the code by omitting all of these.
The two following pieces of code are, therefore, equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
library(ggplot2)
ggplot(cars, (aes(speed)))+
geom_histogram()
Throughout the document, however, we will use the longer code with all the arguments being named. Since you are learning these new functions, it would be quite a challenge to use the shorter code right from the start. But you could certainly adopt the shorter codes later on.
LET’S GET STARTED!
When using latent class models (LCM) to estimate disease prevalence or diagnostic test accuracy, we will need to use various distributions. There are two situations where we will use distributions:
Some distributions will more often be used for the first objective, others will mainly be used for the latter objective. In he following sections, we will cover the most usefull distributions.
One of the first thing that you will need for any Bayesian analysis is a way to generate and visualize a prior distribution corresponding to some scientific knowledge that you have about an unknown parameter. Generally, you will use information such as the mean and variance or the mode and the 2.5th (or 97.5th) percentile to find a corresponding distribution. You will use various distributions. Here is a list of the most important ones, we will give more details on each of them in the following sections:
- Normal distribution: defined by a mean (mu) and
its standard deviation (SD) or variance (tau). In some packages and
software, rjags and JAGS for instance, the inverse of the
variance (1/tau) is used to specify a given normal distribution.
Notation: dNorm(mu, 1/tau)
- Uniform distribution: defined by a minimum (min)
and maximum (max).
Notation: dUnif(min, max)
- Beta distribution: bounded between 0 and 1, beta
distributions are defined by two shape parameters (a) and (b).
Notation: dBeta(a, b)
The dnorm() function can be used to generate a given
Normal distribution and the curve() function can be used to
visualize the generated distribution. These functions are already part
of R, there is no need to upload a R
package.
curve(dnorm(x, mean=2.0, sd=0.5), # I indicate mean and SD of the distribution
from=-2, to=7, # I indicate limits for the plot
main="Normal distribution with mean of 2.0 and SD of 0.5", #Adding a title
xlab = "Value", ylab = "Density") #Adding titles for axes
Density curve of a Normal distribution.
Note that a Normal distribution with mean of zero and a very large SD provides very little information. Such distribution would be referred to as a uniform or flat distribution (A.K.A.; a vague distribution).
curve(dnorm(x, mean=0.0, sd=10000000000),
from=-100, to=100,
main="A flat Normal distribution",
xlab = "Value", ylab = "Density")
Density curve of a flat Normal distribution.
In the same manner, we could visualize an uniform distribution using
the dunif() function. In the following example, we assumed
that any values between -5.0 and 5.0 are equally probable.
curve(dunif(x, min=-5.0, max=5.0),
from=-10, to=10,
main="Uniform distribution with -5.0 and 5.0 limits",
xlab = "Value", ylab = "Density")
Density curve of an Uniform distribution.
Beta distributions are another type of distributions that will
specifically be used for parameters that are proportions (i.e., bounded
between 0.0 and 1.0). For this specific workshop, they will be very
handy, since a sensitivity, specificity, or a prevalence are all
proportions . The epi.betabuster() function from the
epiR library can be used to define a given prior
distribution based on previous knowledge. When you use the
epi.betabuster() function, it creates a new R
object containing different elements. Among these, one element will be
named shape1 and another shape2. These correspond to
the a and b shape parameters of the corresponding Beta distribution.
For instance we may know that the most likely value for the sensitivity of a given test is 0.85 and that we are 97.5% certain that it is greater than 0.75. With these values, we will be able to find the a and b shape parameters of the corresponding Beta distribution.
library(epiR)
# Sensitivity of a test as Mode=0.85, and we are 97.5% sure >0.75
rval <- epi.betabuster(mode=0.85, conf=0.975, greaterthan=T, x=0.75) # I create a new object named rval
rval$shape1 #View the a shape parameter in rval
## [1] 62.661
rval$shape2 #View the b shape parameter in rval
## [1] 11.88135
#plot the prior distribution
curve(dbeta(x, shape1=rval$shape1, shape2=rval$shape2), from=0, to=1,
main="Prior for test's sensitivity", xlab = "Proportion", ylab = "Density")
Density curve of a Beta distribution for a test sensitivity.
Note that a dBeta(1.0, 1.0) is a uniform beta distribution.
#plot the prior distribution
curve(dbeta(x, shape1=1.0, shape2=1.0), from=0, to=1,
main="A Beta(1.0, 1.0) or flat distribution", xlab = "Proportion", ylab = "Density")
Density curve of a Beta(1.0, 1.0) distribution.
In many situations, a distribution will be used as a part of the likelihood function to link observed data to unknown parameters. The ones we will most frequently used here are:
- Binomial distribution: For variables that can take
the value 0 or 1. Binomial distributions are defined by a probability,
P, which is the probability that the variable takes the value 1, and a
number of “trials”, n.
Notation: dBin(P, n)
- Multinomial distribution: made up of a number of
probabilities, all bounded between 0 and 1, and that, together, will sum
up to 1. Multinomial distributions are defined by k probabilities (P1,
P2, …, Pk) and a number of observation (n).
Notation: dmulti(P[1:k], n)
If we have a variable that can take only two values, healthy or diseased (0 or 1), we can estimates an unknown parameter such as the proportion (P) of diseased individuals (i.e., the disease prevalence), based on the observed data (a number of positive individuals, T AND a total number of individuals, n). For instance, if we had 30 diseased (T = 30) out of 100 tested individuals (n=100) we could estimate the unknown parameter P using this very simple likelihood function:
\(T \sim dbin(P, n)\)
A last type of distribution that we will use in our LCM is the multinomial distribution. When an outcome is categorical with >2 categories, we could use a multinomial distribution to describe the probability that an individual has the value “A”, or “B”, or “C”, etc. In our context, the main application of this distribution will be for describing the combined results of two (or more than two) diagnostic tests. For instance, if we cross-tabulate the results of Test A and Test B, we have four potential outcomes:
-Test A+ and Test B+ (lets call n1 the number of individuals in that
cell and P1 the probability of falling in that cell)
-Test A+ and Test B- (n2 and P2)
-Test A- and Test B+ (n3 and P3)
-Test A- and Test B- (n4 and P4)
We can illustrate this as follow:
| Test A+ | Test A- | |
|---|---|---|
| Test B+ | n1 | n3 |
| Test B- | n2 | n4 |
Thus we could describe the probabilities (P1 to P4) of an individual falling into one of these cells of the 2x2 table as a multinomial distribution. In this specific case, we would say that the combined results of the 2 tests (n1 to n4) and the total number of individual tested (n), which are the observed data, are linked as follow to the 4 probabilities (P1 to P4; the unknown parameters):
\(n[1:4] \sim dmulti(P[1:4], n)\)
Which means: the values of n1 (or n2, n3, or n4) is determined by the probability of falling into the “Test A+ and Test B+” cell, which is P1 (or P2, P3, or P4 for the other cells), and by the total number of individuals tested (n). Nothing too surprising here… If I have a probability of 0.30 to fall in a given cell, and I have tested 100 individuals, I should find 30 individuals in that cell. Wow! Still, the multinomial distribution is nice because it will ensure that all our probabilities (P1 to P4) will sum up to exactly 1.0.
Different software and R packages can be used to run
Bayesian models. Whether it is a LCM or a more conventional regression
model is not very important at this stage. In this document we will
mainly use a software called JAGS and the R2jags package.
So, if not already done, you will have to download JAGS here and
install it on your computer. Once installed, we will be able to use
R2jags to operate JAGS through R.
Basically, to run any Bayesian model, we will always need the same three things:
This is the easy part. If we were to run, for instance, a
“conventional” regression analysis, we would have to import in
R a complete dataset (e.g., a CSV file) with a row for each
observation and a column for each variable. However, when using LCM to
estimate a disease prevalence or diagnostic accuracy, the dataset is
simply made up of the number of individuals in each category. For
instance, for estimating a simple prevalence (Prev), we could
have 30 individuals that tested positive (variable T) out of a
total of 100 individuals (variable n). In such case, the
dataset could be created as follow:
#Creating a dataset
datalist <- list(T=30, n=100)
If you want to check that it works, you could ask to see this new
R object.
datalist
## $T
## [1] 30
##
## $n
## [1] 100
Lets start with a simple example, estimating a single proportion. For estimating a proportion we could describe the likelihood function linking unknown parameters to observed data as follow:
\(T \sim dbin(Prev, n)\)
In other words: the disease prevalence (Prev), an unknown parameter, can be linked to the observed number of positive individuals (T) and the total number of tested individual (n) through the binomial function.
In this case, the values of T and n are the observed data and Prev is the only unknown parameter.
We will have to specify a prior distribution for each unknown parameter (using the \(\sim\) sign).
In the preceding example, we only have one unkown parameter,
Prev, which is a proportion. Theoretically such parameter could
take values from 0 to 1, so a beta distribution would be one way to
describe its prior distribution. We could use the
epi.betabuster() function from the epiR
library to define a given beta distribution based on previous knowledge,
as described in the preceding sections. Moreover, if we want to use a
flat prior, we could simply choose the value 1.0 for the a and b shape
parameters.
In the following code, I am creating R objects for a and
b for Prev beta prior distributions.
Prev.shapea <- 1 #a shape parameter for Prev
Prev.shapeb <- 1 #b shape parameter for Prev
With informative priors, for instance, if we know from the literature that the prevalence is expected to be 0.25 with 95CI of (0.20, 0.30), we could instead use the following code and assign these values to Prev.shapea and Prev.shapeb:
library(epiR)
# Prevalence as Mode=0.25, and we are 97.5% sure >0.20
rval <- epi.betabuster(mode=0.25, conf=0.975, greaterthan=T, x=0.20) # I create a new object named rval
rval$shape1 #View the a shape parameter in rval
## [1] 62.356
rval$shape2 #View the b shape parameter in rval
## [1] 185.068
#I left them as comments, but the two lines below would be used to assign the generated values for later usage
#Prev.shapea <- rval$shape1
#Prev.shapeb <- rval$shape2
Our Bayesian model has to be presented to JAGS as a text file (i.e.,
a .txt document). To create this text file, we will use the
paste0() function to write down the model and then the
write.table() function to save it as a .txt file.
An interesting feature of the paste0() function
is the possibility to include R objects in the
text. For instance, you may want to have a general model that
will not be modified, but to change the values used to describe the
priors. Within your paste0() function you could call these
“external” values. Thus, you just have to change the value of the
R object that your model is using, rather than rewriting a
new model. For instance, we just created these two R
objects called Prev.shapea and Prev.shapeb (which were
the a and b shape parameters for the prior distribution of our
proportion Prev). To describe our prior distribution, we can
ignore this previous work that we have done and directly create a short
text as follow:
text1 <- paste0("Prev ~ dbeta(1, 1)") #Creating a short text named "text1"
text1 #Asking to see the text1 object
## [1] "Prev ~ dbeta(1, 1)"
However, if the same model is to be applied later with a different
prior distribution, it may be more convenient to leave the a and b shape
parameters as R objects. Thus, if we change the prior
distribution and run the script again, you would run the entire analyses
with the new distribution.
text2 <- paste0("Prev ~ dbeta(",Prev.shapea,", ",Prev.shapeb,")") #Creating a short text named "text2" and containing other already created R objects
text2 #Asking to see the text2 object
## [1] "Prev ~ dbeta(1, 1)"
When using paste0(), any text appearing between sets of
quotations marks and within commas (i.e.,” “,bla-bla,” “) will be
considered as a R object that need to be included between
the pieces of text included in each quotation. Thus, later, I could just
change the values for Prev.shapea and Prev.shapeb and
my model will be automatically modified.
Some important notes about the text file containing the model
before we start with R2jags:
JAGS is expecting to see the word “model” followed by curly brackets { }. The curly brackets will contain the likelihood function AND the priors.
Similarly to R, any line starting with a # will be
ignored (these are comments).
Similarly to R, <- means equal.
The ~ symbol means “follow a distribution”.
Similarly to R, JAGS is case sensitive (i.e.,
P does not equal p).
Loops can be used with keyword “for” followed by curly brackets.
Array can be indicated using squared brackets [ ].
In the example below, I create a R object called
model_single_prop using the paste0() function and
then save it as a text file using the write.table()
function. We will check this text file below and then explain the code
for the model.
# Specify the model
model_single_prop <- paste0("model{
#=== LIKELIHOOD ===#
T ~ dbin(Prev, n)
#=== PRIOR ===#
Prev ~ dbeta(",Prev.shapea,", ",Prev.shapeb,") ## Prior for Prev
}")
#write as a text (.txt) file
write.table(model_single_prop,
file="model_single_prop.txt",
quote=FALSE,
sep="",
row.names=FALSE,
col.names=FALSE)
Here is a snapshot of the final result (i.e., the text file):
Text file with the model for estimating a single proportion.
The initial value is the first value where each Markov chain will
start.
- You have to specify one initial value for each unknown parameter in
the model.
- If you decide to ran three Markov chains, you will need three
different sets of initial values.
For instance, in our preceding example we have one unknown parameter (Prev). If we were to run three Markov chains, we could generate three sets of initial values for this parameter as follow. Of course, the chosen values could be changed. In the case of a proportion, you can specify any value between 0 and 1.
#Initializing values for the parameter "Prev" for the 3 chains
inits <- list(
list(Prev=0.10),
list(Prev=0.50),
list(Prev=0.90)
)
We know have a R object that I have called
inits and it contains the values where each of the three Markov
chains will start.
We now have the three elements that we needed: data, a model, and a
set of initial values. We can now use the R2jags library to
run our Bayesian analysis. The R2jags package provides an
interface from R to the JAGS software for Bayesian data
analysis. JAGS uses Markov Chain Monte Carlo (MCMC) to generate a sample
from the posterior distribution of the parameters.
The R2jags function that we will use for that is the
jags() function. Here are the arguments that we need to
specify:
data=model.file=.parameters.to.save=.n.chains=. The
default is 3.inits=. If inits is
NULL, JAGS will randomly generate values.n.iter=.n.burnin=.DIC=TRUE or DIC=FALSE
will determine whether to compute the deviance, pD, and deviance
information criteria (DIC). These could sometimes be useful for model
selection/comparison. We could leave that to DIC=FALSE
since we will not be using these values in the workshop.With the following code, we will create a R object
called bug.out. This object will contains the results of
running our model, with the specified priors, using the available data,
and the sets of initial values that we have created. For this analysis,
we have asked to run 3 Markov chains, each for 6000 iterations and we
discarded the first 1000 iterations for each chain. Finally, we will
monitor our only unknown parameter, Prev and nothing else.
library(R2jags)
library(coda)
#Run the Bayesian model
bug.out <- jags(data=datalist, #Specifying the R object containing the data
model.file="model_single_prop.txt", #Specifying the .txt file containing the model
parameters.to.save=c("Prev"), #Specifying the parameters to monitor. When >1 parameter, it will be a list, ex: c("Prev", "Se", "Sp")
n.chains=3, #Number of chains
inits=inits, #Specifying the R object containing the initial values
n.iter=6000, #Specifying the number of iterations
n.burnin=1000, #Specifying the number of iterations to remove at the begining
n.thin=1, #Specifying the thinning rate
DIC=FALSE) #Indicating that we do not request the deviance, pD, nor DIC
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 4
##
## Initializing model
As you can see, we received a few messages. It looks good, don’t you think?
Before jumping to the results, you should first check:
- Whether the chains have converged,
- whether the burnin period was long enough,
- Whether you have a large enough sample of values to describe the
posterior distribution with sufficient precision,
- Whether the Markov chains behaved well (mainly their
auto-correlation).
There are many diagnostic methods available for Bayesian analyses, but, for this workshop, we will mainly use basic methods relying on plots.
Diagnostics are available when you convert your model output into a
MCMC object using the as.mcmc() function of the
mcmcplot library.
library(mcmcplots)
bug.mcmc <- as.mcmc(bug.out) #Converting the model output into a MCMC object (for diagnostic plots)
You can then use the mcmcplot() function on this newly
created MCMC object.
mcmcplot(bug.mcmc, title="Diagnostic plots") #Asking for the diagnostic plots
When you do that, a HTML document will be created an your web browser will automatically open it. Here is a snapshot of the HTML file in my browser:
Figure. Diagnostic plots.
To check whether convergence of the Markov chains was obtained you can inspect the “running mean” (right middle figure) and “trace plot” (the bottom figure) presenting the values that were picked by each chain on each iteration. The different chains that you used (three in the preceding example) will be overlaid on the trace plot. You should inspect the trace plot to evaluate whether the different chains all converged in the same area. If so, after a number of iterations they should be moving in a relatively restricted area and this area should be the same for all chains.
The plots above would be a good example of the trace plot of three “nice” Markov chains (lucky us!). In this preceding example, we can see on the running mean plot that the chains started from different initial values, but they very quickly converged (after less than a 1000 iterations) toward the same area. On the trace plot, we can see the values that were stored after the burn in period. The chains were then moving freely within this limited area (thus sampling from the posterior distribution).
We could conclude based on the trace plot that:
- The Markov chains did converged
- A short burn in period of less than a 1000 iterations would have been
sufficient. Note that we will often use a longer burn in (e.g., 5,000 or
even more) just to be sure and because it only costs a few more seconds
on your computer…
Here is another example where two chains (the red and green)
converged in the same area, but a third one (the blue) also converged,
but in a different area. We should be worry in such
case. We will see later the reason for such behavior.
Finally, here is a last example with just one Markov chain and it
clearly did not converged, even after 25,000 iterations. You
should be terrified by that!!! ;-).
Next, you can inspect the posterior distributions to determine if you have a large enough sample of independent values to describe with enough precision the median estimate and the 2.5th and 97.5th percentiles. The effective sample size (EES) takes into account the number of values generated by the Markov chains AND whether these values were auto-correlated, to compute the number of “effective” independent values that are used to described the posterior distribution. Chains that are auto-correlated will generate a smaller number of “effective values” than chains that are not auto-correlated.
The effectiveSize() function of the coda
library provide a way to appraise whether you add enough iterations. In
the current example, we already created a mcmc object named
bug.mcmc. We can ask for the effective sample size as
follow:
effectiveSize(bug.mcmc) #Asking for the ESS
## Prev
## 8918.033
So we see here that, despite having stored 3 times 5,000 values (15,000 values), we have the equivalent of a bit more than 9,000 independent values. What to do with this information? Well, you could, for instance, decide on an arbitrary rule for ESS (say 10,000 values) and then adapt the length of the chains to achieve such an ESS (for each of the selected parameters, because the effective sample sizes will not be the same for all parameters).
Remember, a feature of Markov chains is to have some auto-correlation between two immediately subsequent iterations and then a correlation that quickly goes down to zero (i.e., they have a short memory). The auto-correlation plot can be used to assess if this feature is respected.
From our previous example (check the upper right figure in the plot
below), it seems that it is the case.
Here is another example that should be worrying.
When such a behavior is observed, running the chains for more iterations will help achieve the desire ESS. In some situations, specifying different priors (i.e., increasing the amount of information in priors) may also help.
Now, if you are happy with the behavior of your Markov chains, the
number of iterations, and burn in period, you can summarize your results
(i.e., report the median, 2.5th and 97.5th percentiles) very easily with
the print() function. Below, I have also indicated
digits.summary=3 to adjust the precision of the estimates
that are reported (the default is 1).
print(bug.out, #Asking to see the mean, median, etc of the unknown parameters that were monitored
digits.summary=3) #Requiring a precision of 3 digits
## Inference for Bugs model at "model_single_prop.txt", fit using jags,
## 3 chains, each with 6000 iterations (first 1000 discarded)
## n.sims = 15000 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## Prev 0.305 0.045 0.221 0.273 0.303 0.334 0.398 1.001 6300
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
Here you see that the median Prev estimate (95CI) was (after rounding off) 0.30 (0.22, 0.40).
You can also ask for plotting the bug.mcmc object that you
have created using the plot() function. It would produce
the trace plot and a density plot presenting the posterior distributions
of each unknown parameter.
plot(bug.mcmc) #Asking for trace plot and plots of prior and posterior distributions
Trace plot and density plot produced using the plot() function.
If you prefer to work on your own plots, note that all the values sampled from your posterior distributions were stored in the bug.out object that you have created. If you inspect the bug.out object, you will notice an element named BUGSoutput, inside this element, there is another object named sims.list, within this sims.list, there is an element named Prev. As you can see below, this element is made of a list of 15000 values. This correspond to the 5000 values sampled (1/iteration) by each Markov chain (3 chains) for the unknown parameter Prev. So this element is simply all the sampled values assembled together.
sims.list object located in the bug.out output.
You could use this element, for instance, to plot together the prior and posterior distributions on a same figure. On the figure below, the prior distribution is not obvious, it is the flat dashed red line just above the x axis (remember, we used a flat prior for Prev).
#Plot the posterior distribution
plot(density(x=bug.out$BUGSoutput$sims.list$Prev), #Indicating variable to plot
main="Disease prevalence", #Title for the plot
xlab="Prevalence", ylab="Density", #x and y axis titles
)
#plot the prior distribution
curve(dbeta(x, shape1=Prev.shapea, shape2=Prev.shapeb), from=0, to=1, #Plotting the corresponding prior distribution
lty=2, #Asking for a dashed line pattern
col="red", #Asking for a red line
add=TRUE) #Asking to add this plot to the preceding one
Prior (dashed red line) and posterior (full black line) distribution of the prevalence of disease.
The data used for the following exercises came from a study aiming at estimating diagnostic accuracy of a milk pregnancy associated glycoprotein (PAG) ELISA and of transrectal ultrasonographic (US) exam when used for determining pregnancy status of individual dairy cows 28 to 45 days post-insemination. In that study, the new test under investigation was the PAG test and US was used for comparison, but was considered as an imperfect reference test.
In the original study, data from 519 cows from 18 commercial dairy herds were used. For the following exercises, the dataset was modified so that we have data from 519 cows from 3 herds (i.e., the data from 16 herds were collapsed together so they appear to be from one large herd). The complete original publication can be found here: Dufour et al., 2017.
The dataset Preg.xlsx contains the results from the study. However, for the exercises we will always simply used the cross-tabulated data and these will be already organized for you and presented at the beginning of each exercise. The dataset is still provided so you can further explore additional comparisons. The list of variables in the Preg.xlsx dataset are described in the table below.
Table. List of variables in the Preg.xlsx dataset.
| Variable | Description | Range |
|---|---|---|
| Obs | An observation unique ID number | 1 to 519 |
| Herd | Herd ID number | 1 to 3 |
| Cow | Cow ID number | NA |
| T1_DOP | Number of days since insemination at testing | 28 to 45 |
| PAG_DX | Results from the PAG test | 0 (not pregnant); 1 (pregnant) |
| US_DX | Results from the ultrasound test | 0 (not pregnant); 1 (pregnant) |
| TRE_DX | Results from the transrectal exam (TRE) test | 0 (not pregnant); 1 (pregnant) |
In the study, we had the following proportion of apparently pregnant cows (based on the US exam).
257 US+/497 exams
Open up the R project for Exercise 1 (i.e., the R
project file named Exercise 1 - Intro Bayesian.Rproj).
In the Question 1 folder, you will find a partially completed
R script named Question 1.R. To answer the
following questions, try to complete the missing parts (they are
highlighted by a #TO COMPLETE# comment). We also
provided complete R scripts, but try to work it out on your own
first!
Start from the partially completed Question 1.R
R script located in Question 1 folder.
What would be the proportion of pregnant cows? Use a Bayesian approach to compute that proportion and a credible interval (CI). For this first computation, assume that you have no prior knowledge on the expected proportion of pregnant cows in a Canadian herd. Run three Markov chains with different sets of initial values. Look at the trace plot. Do you think convergence was achieved? Do you need a longer burn-in period? Are all 3 chains converging in the same space? Compute the effective sample size (ESS), do you feel that number of iterations was sufficient to describe the posterior distribution with enough precision? What about auto-correlation of the Markov chains, any issues there?
If you were to compute a frequentist estimate with 95% CI, would it differ a lot from your Bayesian estimates? Why? As a refresher, the formula for a frequentist 95%CI for a proportion is below (where P is the actual observed proportion and n is the denominator for that proportion):
\(95CI=P \pm 1.96* \sqrt{\frac{P*(1-P)}{n}}\)
In the literature, you saw a recent study conducted on 30 Canadian herds and reporting an expected pregnancy prevalence of 42% with 97.5% percentile of 74%. What kind of distribution could you use to represent this information? Use these information as a pregnancy prevalence prior distribution. Are your results very different?
When comparing PAG to TUS results for the whole dataset, you got the following 2x2 table.
Table. Cross-classified results of the PAG and TUS tests.
| TUS+ | TUS- | Total | |
|---|---|---|---|
| PAG+ | 255 | 21 | 276 |
| PAG- | 2 | 219 | 221 |
| Total | 257 | 240 | 497 |
Assuming that TUS is a gold-standard test could you compute PAG sensitivity (Se) and specificity (Sp)? Use vague priors for PAG Se and Sp since it is the first ever study on this topic (i.e., you do not have any prior knowledge on these unknown parameters).
Answer:
Answer:
Answer:
Answer: